\section{Competition Details}
The KDD Cup \cite{kddwebsite} is a competition that challenges participants to a Data Mining problem. The problem for this year (2009)'s competition is a set of three binary classification problems based on the marketing database of French telecommunication services provider Orange Telecom. The three problems are as follows
\begin{itemize}
\item \textbf{Churn} - Churn rate is also sometimes called attrition rate. It is one of two primary factors that determine the steady-state level of customers a business will support. In its broadest sense, churn rate is a measure of the number of individuals or items moving into or out of a collection over a specific period of time.
\item \textbf{Appetency} - Appetency is the propensity to buy a service or a product.
\item \textbf{Upselling} - Up-selling can imply selling something additional, or selling something that is more profitable or otherwise preferable for the seller instead of the original sale. 
\end{itemize}
\subsection{About the Dataset}
Two sets of data were made available for the competition. A large dataset consisting of 50000 train and test instances with 15000 attributes, and a small dataset consisting of 50000 train and test instances 230 attributes.The dataset consists of a mix of numerical and categorical attributes. The large dataset consists of 14740 numerical attributes and 260 categorical attribtues, the small dataset consists of 190 numerical attributes and 40 categorical attributes. Our work is based on the small dataset. 
\subsubsection{Data Anonymity}
In order to protect the privacy of the consumers the complete dataset,the attribute names and the attribute values have been anonymized. This means that the data is of the following form with no semantic information on the attributes or their specific values. This was a significant hindrance to our work on this project as without any knowledge of the semantic information it was difficult to develop any intuition or heuristics on the approaches we attempted.

\begin{table} 
\caption{Attribute Anonymization}
\centering
\begin{tabular}{|c|c|c|c|}
\hline 
Var 1 & Var 2 & ... & Var 230\tabularnewline
\hline
12.4 & 13.2 & ... & 4ef3b4\tabularnewline
\hline
\end{tabular}
\end{table}

\subsubsection{Missing Values}
The dataset consists of a number of missing values for instances. On an average the training dataset consists of about 159 missing values with a range of [0,50000]

\subsubsection{Skewedness of the Dataset}

The training dataset provided is extremely skewed towards negative samples. 
\begin{itemize}
\item Churn consists of about 93\% negative samples.
\item Appetency consists of about 98\% negative samples.
\item Upselling consists of about 93\% negative samples.
\end{itemize}

\subsubsection{Usage of Test Data}
The test labels weren't released by the competition organizers and the evaluation of the results was possible through an online submission system. Post the competition closing data of May 11$^{\textrm{th}}$ the system wasn't available, hence all our results were calculated using ten fold\cite{cross}.

\subsection{Evaluation}
The main objective of the challenge is to make good predictions of the target variables. The prediction of each target variable is thought of as a separate classification problem. The results of classification, obtained by thresholding the prediction score, may be represented in a confusion matrix, where tp (true positive), fn (false negative), tn (true negative) and fp (false positive) represent the number of examples falling into each possible outcome:

\begin{table} 
\caption{Confusion Matrix}
\centering
\begin{tabular}{|c|c|c|c|}
\hline 
\multicolumn{2}{|c|}{} & \multicolumn{2}{c|}{Prediction} \\
\hline
 &  & Class +1 & Class -1 \\
\hline 
\multirow{2}{*}{Truth} & Class +1 & True Positive(tp) & False Negative (fn) \\ 
 & Class -1 & False Positive (fp) & True Negative (tn) \\
\hline
\end{tabular} 
\end{table}
Based on the matrix above sensitivity and specificity are defined as follows


$Sensitivity=$$\frac{tp}{tp+fn}$ ,


$Specificity=$$\frac{tn}{tn+fp}$
\subsubsection{AUC}
The results are evaluated with the so-called Area Under Curve (AUC). It corresponds to the area under the curve obtained by plotting sensitivity against specificity by varying a threshold on the prediction values to determine the classification result. The AUC is related to the area under the lift curve and the Gini index used in marketing (Gini=2 AUC -1). The AUC is calculated using the trapezoid method. In the case when binary scores are supplied for the classification instead of discriminant values, the curve is given by {(0,1),(tn/(tn+fp),tp/(tp+fn)),(1,0)} and the AUC is just the Balanced ACcuracy BAC.\linebreak
$BAC=$$\frac{(Sensitivity+Specificity)}{2}$
